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Abstract 

The fields of machine learning and mathematical optimization increasingly intertwined. The special 
topic on supervised learning and convex optimization examines this interplay. The training part 
of most supervised learning algorithms can usually be reduced to an optimization problem that 
minimizes a loss between model predictions and training data. While most optimization techniques 
focus on accuracy and speed of convergence, the qualities of good optimization algorithm from 
the machine learning perspective can be quite different since machine learning is more than fitting 
the data. Better optimization algorithms that minimize the training loss can possibly give very 
poor generalization performance. In this paper, we examine a particular kind of machine learning 
algorithm, boosting, whose training process can be viewed as functional coordinate descent on the 
exponential loss. We study the relation between optimization techniques and machine learning by 
implementing a new boosting algorithm. DABoost, based on dual-averaging scheme and study its 
generalization performance. We show that DABoost, although slower in reducing the training error, 
in general enjoys a better generalization error than AdaBoost. 


1 Introduction 


Optimization formulations and methods lie at the heart of many machine learning algorithms Murphy ( 2012[ ). A 
larger number of machine learning algorithms reduce to optimization problems. For example, support vector machine 
(SVM) [Hearst et ah (1998 1 ; [Scholkopf & Smola ( 1998| l; Steinwart & Christmann ( |2008 1 minimizes the hinge loss 
function between the training data and the model prediction. Latent models for sequential data Rabiner & Juang 
( 1986| l; Bahl et al. ( jl986[ ); |Huang & Rao pOll 2014[ ) maximizes the conditional likelihood of observed data. Logistic 
regression minimized the negative log conditional likelihood of training data given the model. Decision making 
problems Boutilier ( 2002[ ); Levin et al. ( |1998| ; Huang et al. ( 2012[ ); Huang & Rao ( 2013| l can be formulated in terms 
of maximizing the sum of future rewards. Optimization algorithms are widely used for training a machine learning 
model. However, machine learning is more than simply a consumer of optimization techniques since machine learning 
concerns not only about model training but also model validation. The criterion used to validate the efficacy of a 
model is not the same as the criterion used for training the model. In a optimization problem, the quality of a good 
solution would be measured by its speed (convergence rate) and accuracy (objective gap). But in machine learning, 
the generalization performance is perhaps the most important metric of solution quality. An optimization algorithm 
that has a poor convergence rate may score a high generalization performance when it’s applied to train a machine 
learning problem. Therefore machine learning presents new challenges to mathematical optimization. It is still an open 
question on what are the desirable properties of an optimization algorithm from the machine learning perspective. In 
this paper, we study boosting Freund & Schapire ( |I997b| l; Freund et al. (19991, a machine learning method that is 
famous for its resistance to over-fitting. For example, the winners of the HiggsML Challenge on Kaggle, develop 
and use the Boosting library, XGBoost |Chen et ah] ( |2013[ ), to win this competition. We formulate boosting as an 
optimization problem on the exponential loss and show its equivalency to gradient descent. We introduce a novel 
variant of boosting algorithm, based on a new optimization algorithm called the dual averaging methods that minimizes 
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the same exponential loss function. We examine the performance of the standard boosting algorithm and ours, from 
both an optimization perspective and a machine learning perspective. 


Boosting is a general method to derive strong learner from weak learning algorithms. The boosting method is based on 
the observation that finding base (weak) learners that performs just slightly better than random guessing can be a lot 
easier than finding a single, highly accurate one. Kearns & Valiant ( |1988) l first postulated the conjecture of whether a 
combination of base learners can be boosted into an arbitrary accurate strong learner in the framework of PAC(probably 
approximately correct) learning model. [Freund & Schapire ( |1997a| ) introduced the first practical boosting algorithm 
in binary classification, called AdaBoost, which repeatedly calls a base learning algorithm to train different classifiers 
that fits the re-sampled training examples from a different distribution. At each round the AdaBoost algorithm assigns 
larger weights on the harder examples, this effectively forces the base learning algorithm to focus its attention on the 
examples that were misclassified by the preceding classifier, and to come up with a new classifier that is hopefully 
more accurate. AdaBoost then combines those weak classifier by simply taking a weighted majority vote of their 
predictions. AdaBoost is shown to give significant accuracy improvements over base learning algorithms. 


Looking to extend upon the success of AdaBoost, many attempts have been successfully at providing general algo¬ 
rithms for boosting. |Breiman| ( |1999| l and jMason et al.| ( ri999| ) made the cracial links between AdaBoost and optimiza¬ 
tion by reformulating AdaBoost from a gradient descent point of view on an exponential loss function. This intuitive 
connection was further developed and many variants of AdaBoost were seen as performing gradient descent, but with 
different loss functions and different gradient descent methods. Furthermore, as pointed out by [Grubb & Bagnell] 
( 201 l[ l in their recent work, the existing gradient-based boosting algorithm can fail to converge on some non-smooth 
convex objectives. To address this issue, they presented new algorithms that can be extended to arbitrary convex loss 
functions with convergence guarantee. Flowever, one limitation of these existing algorithms is that they computed new 
classifier based only on the sub-gradient of loss function at previous iteration. It was known that [Xiao (20101 this 
kind gradient descent method lacks the capability in exploiting the feasible set, especially when the loss function has 
additional regularization term such as li norm for promoting sparsity. In this project, we would like to apply gradient 
descent method that involves the ranning average of all past sub-gradients of loss function (known as dual averaging 
method [Nesterov ( 2009| l; Baes & Burgisser ( |201 ij l), to the boosting framework. In addition, we would like to study 
the convergence results of the proposed algorithm. Finally, we will demonstrate experimental results that support our 
analysis and examples that show the need for the new algorithm based on dual-averaging scheme. 


The remainder of this paper is organized in the following way. Section]^ first describe the Adaboost algorithm and 
formulated it as a gradient descent on the exponential loss. In sectionj^we introduce the dual averaging method and 
show how to implement it in the boosting setting. We compare the performance results of both boosting algorithms in 
section|4| We conclude with discussions in section|5] 


2 AdaBoost 

2.1 Algorithm description 


Algorithm 1 AdaBoost 

Initial Di{i) = l/nVi = 1,... ,n. 

for f = 1,..., T do 

Train the weak classifier ht with smallest training error et with respect to Dt- 
Choose r]t = ^ log 

Let Zt = ‘IsJ et(l — et) be a normalization factor so that Dt+i will be a distribution. 
Update Dt+i{i) = Dt[i) ey.Y>{-'ntyiht{xi))/Zt, for all i 
The final classifier ft = ft-i + rjtht = Vshs 

end for 


The AdaBoost algorithm (shown in Algorithm[^ is arguably one of the most crucial developments in machine learning 
in the past two decades. AdaBoost can train classifiers with extreme small generalization errors from base learners as 
weak as decision stumps or as strong as neural networks. Let {{xi, yi)}i=i,...,n be the training set where the training 
instance Xi G X and the training label yt G { — 1,1}. AdaBoost calls a given weak or base learning algorithm 
repeatedly in a series of rounds t = 1,... ,T. Adaboost maintain a distribution £>t over the training set, where Vt(^) 
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represents the weight of this distribution on training example i on round t. Initially the distribution is uniform and 
all weights are set equally. But on each round, AdaBoost increase the weights of incorrectly classified examples by 
the previous classifier and decrease the weights of correctly classified ones. In this way, the weak learning algorithm 
is forced to focus on the hard examples in the training set. The job of the weak learner is to find a classifier ht that 
minimized the training error et with respect to distribution Dt : 

e* = ^ir^oAvt A ht{xi)\ = ^ Dt{i) (I) 

i:h{xi)^yi 

In practice, the weak learner may be an algorithm that can make use of the weights Dt on the training examples, or 
a subset of the training examples that are re-sampled according to Dt- Once the weak classifier has been trained, 
AdaBoost chooses a parameter rjt \ log measures the importance that is assigned to ht- Note that r/t > 0 if 
e* A 1/2, and the smaller et gets the larger rjt becomes. The final classifier Jt is a weighted majority vote of T weak 
classifiers where rjt is the weight assigned to ht- 

2.2 A gradient descent view 

Here we describe the general boosting algorithm as gradient descent in function space. We consider the function 
/ : A —> {—1,1} in the function space p) whose Lebesgue integral \\f{x)\Ad^ is finite. The domain X is 

measurable and /x is a probability measure P with empirical probability distribution estimated from training instances 
The inner product in this Hilbert space can be written as: 

n 

{f,9)p =-^{f{xi),g{xi)). (2) 

n ^' 

i=l 

This definition of / represents a great variety of machine learning algorithms ranged from multi-layer perceptron to 
decision tree. Under the framework of empirical risk minimization, we would like to employ the gradient descent 
algorithm to minimize the empirical risk of /, which is a functional T^omp : —?• ffi: 

1 " 

T^cmpi/] = -^l{f{Xn),yn)- (3) 

n 

1—1 

where I is the loss function that measures the difference between the prediction f{xn) and true label y„. The gradient 
of T^emp with respect to a function / is another function g that makes T^omp [/ + gg] change the most rapidly: 

g{x) = V7^e„.p[/](x) = ^^emp^^ + r/1.] | ^ ^ ^ 

where is the indicator function of x. In contrast to the standard gradient descent algorithm, boosting restrict a set 
of allowable descent directions called the feasible set H, which correspond directly to the set of hypotheses generated 
by the base learner. H can be a set of all possible decision trees generated by C4.5 algorithm, or a set of all possible 
support vector machines. Given H, we would like to find a hypothesis h* that is the closest to the computed negative 
gradient, h* can then be found by projecting the negative gradient onto T-L. 

h* = argmax(-V7^omp[/], h)p. (5) 

hexi 

Finally the gradient descent algorithm will chose a step size % such that the empirical risk at the updated function 
T^cmplf + Vth*] is minimized. 

For Adaboost, it can be shown that 7^emp[/] = ®xp(—We have 

V7^omp[/](a;) = -yiex.p(-yif{x^))Sx^xi (6) 

where Sx,xi = 1 if a; = Xi, otherwise 6x,xi = 0. Finding the closet hypothesis h* from H is equivalent to maximizing 

(-V7^e„,p[/t-l], /i )P = -^J2 

I 

t-1 

= '^yih{xi)Dt(i)\^ Zs] 

i 

(X [1 - 2e(] 
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Algorithm 2 Gradient Projection Algorithm 
Given starting point /q. 
for t = 1,..., T. do 

Compute the gradient V7^omp[/t-i] 

Find h* = argmax,,g«(-V7^emp[/^-l],/i)p- 
Find a step size rjt = argmin,, 7?.omp[/t-i + 
Update ft = ft-i+ rjth* 

end for 


where Di{i) = ^ ht{xi)\ and 

A(*) = 


e:xjp{-yift-i{xi)) 

nl\lZ\Zs 




exp{-r]t-iyih* {x^)) 


(7) 


Zt-i 

Zt = 2y/etil-et) (8) 

This projection step is equivalent to finding a base hypothesis with smallest mis-classification error et over a boosted 
training set with distribution Dt- Finally, we choose the step size r]t such that 


dUcrapift i+'nh] ^ ^ yih{xi) eyi]){-yift-i{xi) - yih{x,)ri) (9) 

dv 

n 

OC - ^ yih{xi)Dt{i) exp{-y^h{xi)ri) (10) 

= etexp(? 7 ) - (1 - et)exp(p) = 0 (11) 

Vt = 7y^og-—- ( 12 ) 

Therefore, with exponential loss function, the gradient projection algorithm[^is equivalent to the AdaBoost algorithm 
[T] By viewing AdaBoost as a gradient descent in the functional space, it’s tempting to conclude that AdaBoost a 
just an algorithm for minimizing exponential loss and more (less) powerful optimization techniques for the same loss 
should work even better)worse). In the next sections, we are going to test this conclusion by introducing a new variant 
of boosting algorithms that implements a different optimization technique. 


3 Boosting with dual averaging method 


Algorithm 3 Dual Averaging Algorithm 

Given objective function R[f] and regularization function d{f). 
for f = 1,..., T do 
Compute gradient gt = WfR[ft] 

Choose At > 0, set st+i = Sf + Xtgt 
Choose /3t+i- Set 

h* = nf)^^^{-St+i) 

where 

7rfl(s) = argmin{(s, h) + I3d{h)} 
h^y. 

end for 

Update ft+i = ft+ at+ih* 


Dual averaging method (shown in algorithmic has recently been introduced in convex optimization by Nesterov 


(2009|l. In the paper of Baes & Burgisser (|20i 1 1 , they proposed an alternative viewpoint of the Hedge algorithm 
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using dual averaging method. The hedge algorithm has been known for its close relation to the AdaBoost algorithm. 
Ho wever, the hedge algorithm and the AdaBoost algorithm differ in many different ways. First, in hedge algorithm 
(seejBaes & Burgisser (201 l|l for more details), the weight Dt{i) increases if ith strategy is a “good” action at round 
t, while in AdaBoost the weight Dt{i) increases if the t hypothesis suggests a “bad” prediction on the *th example. 
The loss in hedge algorithm that measures the success of the strategy is a actually a measurement of hardness of 
an example in AdaBoost. Thus, the algorithm that minimizes the loss in Hedge algorithm will not minimized the 
training error ct in AdaBoost. Secondly yet more importantly, in AdaBoost the updating rule for the weights is 
Dt+i{i) oc Dt{i) exp(—with a time-varying parameter exp(—pt) that changes at each iteration according 
to training error et- But in hedge algorithm, this parameter is fixed ahead of time [^Therefore, the algorithm described 


in 


Baes & Burgisser|(201 l|l can not directly applied to a boosting algorithm. Instead, we need to design a novel dual 


averaging algorithm that is based on boosting setting. 


To correctly apply the dual averaging method in the boosting setting, we need to define the dual variable st = 
^k9k in the functional space: 


t 

St = ^-Afcyiexp(-y^/(Xi))5a:,a:i- (13) 

k^l 

Because the hypothesis class H applies to arbitrary weak learner. It’s not clear how to define a regularization function 
of rule-based learner (like zero-R) and tree-based learner (such as decision stumps and CART). We let R{h) = 0, then 
finding a weak classiher h* can be written as: 


h* = arg min ^ exp[-yi/fc(a:i)] 

h£l-L ^^ 'TT) ^^ 


/c=l 


• yih{xi) .. 

= arg mm > - 

hen ^ m 

i 

= arg min Ei^Dt [1 - 2et] 
h 

where et = [yt ^ ht{xi)] the probability (weight) for each instance 

'EtJi^k eM-Vifkjxz)] 


Er=i exp[-yjk{xi)] 


(14) 


(15) 


Algorithm 4 DABoost 

Given training samples {xi,yi),..., (x„, 2 /„) 

Initialize Di = - 

for f = 1,..., T do 

Re-sample training data from Dt- 

Find the weak classifier ht that minimize the error rate tt- 
Set the step size at = \ 

Update distribution Dt+i oc Y,k ^k e:Kp[-yifk{xi)] 

end for 

Final classifier f{x) = J2t Oitht{x) 


Equation [TS] defines a way to update the distribution Dt in dual averaging setting. And we choose the time step r]t 
as in equation 1^ ? 7 t = 5 log(l-^). Finally we introduce a novel boosting algorithm, called DABoost (shown in 
algorithm]^, based on dual averaging method. 

Based on algorithm]^ we implement the DABoost algorithm in the WEKA environment so that our DABoost al¬ 
gorithm can call basically any existing machine learning algorithms as the base learner. We fix the time-dependent 
importance parameter A( = 1 be constant in the implementation. 


’the parameter is denoted by 7 in Baes & Burgisser 
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(a) (b) 

Figure 1: This figure shows the Training and test errors of AdaBoost and DABoost, both using stumps as base learner, 
on data sets (a)Heart Disease, (b)Inonospear DABoost tends to produce classifiers wit higher bias but less variance. 


4 Results 


In this section, we evaluate our DABoost algorithm using different data sets in UCI machine learning repository |Lich-| 
|man| ( 20T3] l, and compare our training error and test error to those of AdaBoost. Because the training port of both 
AdaBoost and DABoost can be viewed as convex optimization on the exponential loss. The training error represents 
the objective gap of the loss function, a measurement of quality from the optimization perspective, while the test error 
represent the generalization performance that is a measurement of quality from the machine learning perspective. 


We start with a toy example where the instance x is drawn uniformly from [—1,1]^°°, and the label y is the majority 
vote of three coordinates. The size of the training set n = 1, 000. We use this data set to test the correctness of our 
DABoost algorithm. With this simple data set, both DABoost and AdaBoost (boosting stumps) achieve 0% training 
error after three iterations, as expected. In comparison, popular machine learning algorithms such as SVM, logistic 
regression and multilayer perceptron only score test errors greater than 15% (note that tree-based algorithms such as 
CART and C4.5 can also score 0% test error). 


Figure [T(a)] shows the performance of AdaBoost and DABoost, both using stumps as base learner, in Heart Disease data 
set. The blue curves represent the results from Adaboost while the green curves represent the results from DABoost. 
Training errors are shown in dashed lines while test errors are shown in solid lines. The heart disease data set contains 
14 attributes. The label refers to the presents of heart disease in the patient. The heart disease data set is one of few 
training sets on which the AdaBoost algorithm overfits. As shown in the figure, DABoost converges slowly in terms 
of training error but suffers less over-fitting in terms of test error. 

Figure [T(b)| shows the performance of AdaBoost and DABoost, both boosting decision stumps, in ionosphere data set 
The ionosphere data set contains 34 continuous attributes. The label is either “good” or “bad” where “good” radar 
returns are those showing evidence of some type of structure in the ionosphere and “bad” returns are those that do not. 
The ionosphere data set is commonly used in the machine learning literature. As shown in the figure, DABoost again 
converges slower than Adaboost in terms of training error but achieves better test error. 


In general, DABoost gets higher training error but enjoys lower test error. Similar behavior have been observed in 
many other data sets such as the letter data set by boosting a C4.5 base learner, and in the diabetes data set by boosting 
stumps. This is due to the bias-variance trade-off. DABoost update the distribution Dt based on the an average of 
loss over all previous iterations. The resulting classifier becomes less flexible because as the number of iterations t 
increases, the distribution Dt changes slowly due to the update rule in equation[T^ Thus, DABoost tends to produce 
classifiers wit higher bias but less variance. 

However, DABoost doesn’t always have better generalization performance than Adaboost. Figure|^shows the perfor¬ 
mance of AdaBoost and DABoost, whose base learner is decision stumps, in Webb Spam Corpus data set. Web spam 
is defined as Web pages that are created to manipulate search engines and deceive Web users. All positive examples 
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Figure 2; This figure shows the Training and test errors of AdaBoost and DABoost, both using stumps as base learner, 
on data sets (a)Heart Disease, (b)Inonospear DABoost tends to produce classifiers wit higher bias but less variance. 


were taken and the negative examples were created by randomly traversing the Internet starting at well known (e.g. 
news) web-sites. In this data set, any continuous one byte is treated as a word and the world count is used as the feature 
value. Each instance is normalize to unit length. The number of total features is 254. Due to memory constraints, 
only 1% of instances (3, 500 instances) are used for training. As shown in the figure, DABoost converges slower than 
AdaBoost in terms of training error but also suffers larger test error. The reason for such poorer performance is due to 
the noise in the highly biased labels in the data set. 

5 Conclusion 

In this paper, we discuss the quality of a good optimization algorithm from both a machine learning perspective and a 
mathematical programming perspective. We postulated that a slower convergent optimization algorithm might result 
in a better machine learning algorithm with better generalization performance. We test this postulation by introducing 
a new variant of boosting algorithm, DABoost, based on dual averaging gradient descent method on exponential 
loss. Our simulation results show although slower in obtaining small training error, DABoost in general enjoys better 
generalization error than AdaBoost. 

Our implementation of DABoost is still far from complete and demands a series of future research. We fix the time- 
dependent importance parameter At = 1 be constant in the implementation. A time-varying At might lead to different 
results. Moreover, we simplifies the dual averaging algorithm by restraining the regularization function d{h) = 0. li 
or I 2 regularization might be applied to base learners that can be parametrized by a vector of real numbers. 

DABoost is based on dual averaging algorithm, the recently introduced convex optimization algorithm that has similar 
linear convergence rate as the gradient descent. In our further work, more powerful optimization techniques such as 
accelerated gradient descent with quadratic convergence rate might be implemented in the boosting framework. 
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